In [1]:
from IPython.display import HTML

# Toggle butten to hide the code from the notebook
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
Out[1]:

Calgary Shared Mobility Pilot Trips Analysis

In [2]:
import pandas as pd
import numpy as np
import plotly
import plotly.express as px
import plotly.graph_objects as go
import datetime as dt
from pathlib import Path
import os
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline



import warnings
warnings.filterwarnings('ignore')

Intro

On October 31, 2018, the city of Calgary began operating a shared e-bike and e-scooter pilot program.

Initially, approximately 500 dockless electric bicycles, provided by lime have been available since the beginning of the pilot with 168,000 trips taken and 210,000kms traveled between then and September 30, 2019 [1].

In the summer of 2019, Electric scooters (e-scooters) joined the mix and appeared to be an instant hit. The e-scooters were first made available on July 12, 2019 and were originally available until October 31, 2019 [2]. Both lime and Bird operated the scooter rentals.

Rental data for July 1 to September 30, 2019 was made available through the City of Calgary's open data portal [3].

Using the data, I wanted to attempt to answer the following questions:

  • How far/how long are typical trips?
  • How popular are the bikes/scooters? Which vehicle is most popular?
  • When is the most popular time to rent a Scooter? Are there any noticible trends in rentals based on time of day, day or day of week?
  • Where are the most trips starting and ending?
  • Can we guess how much money has been spent on the scooters/bikes?
  • How does the weather impact rental count?
  • Can we make any assumptions about if people are using the scooters for travel or just for fun?

The Data

The City of Calgary provided data from 482k trips. All trips occurred between July 1 and September 30, 2019.

Data available included:

  • Vehicle Type: e-scooter or e-bike
  • Start Date: The day of the trip
  • Start Hour: Hour the trip was started in 24-hour clock (e.g., 13 is 1:00 pm-1:59 pm, 17 is 5:00 pm-5:59 pm)
  • Trip Distance in meters (m)
  • Trip Duration in seconds (s)
  • Approximate Lat/Lon of where the trip started and ended
    • This was within a 10,000$m^2$ hexagon to anonymize the data
    • i.e. if two trips started in the same hexagon, they will have the same starting point, even though they could be as far as ~120m apart.

Some of the other columns are somewhat redundant, but helpful for analysis, like naming the hexagon or providing the day of the week etc.

Weather data was obtained from Environment Canada's website: [4]

The temperature ($^{\circ}C$), wind speed (km/h) and weather (Sunny, raining etc.) were available for every hour.

Cleaning

Before starting, I first cleaned up the data to make it easier for analysis. Main things were linking the weather data to the scooter data. I also calculated some metrics like speed, aerial distance etc. that will be talked about in a bit more detail later. Analysis starts from the cleaned data table.

All cleaning code is available on GitHub

Below is a sample of the final table:

In [3]:
# Read in pre-formatted dataset
project_dir = Path().resolve().parents[0]
file_name = os.path.join(project_dir, 'data', 'final', 'all_data.csv')
all_trips = pd.read_csv(file_name)
all_trips.datetime = pd.to_datetime(all_trips.datetime)
all_trips.start_date = pd.to_datetime(all_trips.start_date)
all_trips.head()
Out[3]:
vehicle_type start_date start_hour start_day start_day_of_week trip_distance trip_duration starting_grid_id ending_grid_id startx ... a_dist travel_efficiency speed a_speed is_weekend is_holiday datetime Temp (°C) Wind Spd (km/h) Weather
0 scooter 2019-08-22 16 Thursday 4 338 129 DN-104 DN-104 -114.071462 ... 62.040324 0.183551 9.432558 1.731358 0 0 2019-08-22 16:00:00 19.4 11.0 Clear
1 scooter 2019-09-13 23 Friday 5 1092 347 DM-103 DM-103 -114.073762 ... 62.040324 0.056813 11.329107 0.643646 0 0 2019-09-13 23:00:00 9.7 9.0 Clear
2 scooter 2019-08-08 10 Thursday 4 2059 547 AL-37 AP-45 -114.255975 ... 1622.721148 0.788111 13.551005 10.679700 0 0 2019-08-08 10:00:00 20.1 10.0 Clear
3 scooter 2019-08-08 11 Thursday 4 158 228 DN-104 DN-104 -114.071462 ... 62.040324 0.392660 2.494737 0.979584 0 0 2019-08-08 11:00:00 22.2 20.0 Clear
4 scooter 2019-07-24 16 Wednesday 3 1009 308 CG-127 CF-128 -114.147194 ... 186.139286 0.184479 11.793506 2.175654 0 0 2019-07-24 16:00:00 21.0 36.0 Clear

5 rows × 23 columns

Also, a quick look at some summary statistics for the dataset:

In [4]:
all_trips[['trip_distance', 'trip_duration']].describe()
Out[4]:
trip_distance trip_duration
count 482021.000000 482021.000000
mean 1846.298045 771.032598
std 1890.667017 809.441301
min 101.000000 31.000000
25% 639.000000 302.000000
50% 1261.000000 503.000000
75% 2335.000000 912.000000
max 56659.000000 9521.000000

So, the average trip is about 1.8kms and took 12min 51s. That said there are probably lots of long trips bringing up the average; the maximum trip was 56.6kms long!

The median is probably a better measure of typical use. Half of all trips were < 1.3kms and about 8min 20s.

The difference between bicycle trips and scooter trips doesn't seem to be appreciable. note: violin plots are built on a random sample of rides

In [5]:
sample = all_trips.sample(frac=0.05) # 420k datapoints runs too slow

violin_fig1=px.violin(sample, x='vehicle_type', y='trip_distance', box=True, points="outliers",
          labels={'vehicle_type':'Vehicle', "trip_distance": "Trip Distance (m)"},
          title='Sample Distribution of Trip Distances for Bikes and Scooters')
violin_fig1.show()
In [6]:
violin_fig2=px.violin(sample, x='vehicle_type', y='trip_duration', box=True, points="outliers",          
          labels={'vehicle_type':'Vehicle', "trip_duration": "Trip Duration (s)"},
          title='Sample Distribution of Trip Durations for Bikes and Scooters')
violin_fig2.show()

Total Rentals/Popularity

I wanted to see what the average rentals per day looked like over the time period. The plot below splits between e-bicycles and e-scooters.

In [7]:
line_fig1 = px.line(all_trips.groupby(['start_date','vehicle_type']).count().reset_index(), 
               x="start_date", y='a_dist', color='vehicle_type', 
               labels={'a_dist':'Number of Rentals/Day', 'start_date': 'Date', 'vehicle_type': 'Vehicle'},
               title = " Number of Vehicle Rentals per Day over Trial Period")
line_fig1.show()

A few things stand out:

  • There were a few scooter rentals before the official launch date of July 12th, maybe some testing or limited rentals.

  • There's a big jump in rentals towards the end of July. It's worth noting that Lime had 1000 scooters, starting July 12th (don't know if they were all available right away or if they added more), but Bird started operating their fleet of 500 scooters on July 26th [1].

E-Bikes are so 2018!

  • Bikes were decidedly less popular. Removing scooters, it is easier to see that the bike rentals dropped off from around 400/day to 150/day. It seems like the e-scooters are cannibalizing the bicycle rentals
  • Worth noting that e-bikes will not be back for summer 2020 [5]. This is certainly not surprizing, looking at the data.
In [8]:
line_fig2 = px.line(all_trips[all_trips['vehicle_type'] == 'bicycle'].groupby(['start_date','vehicle_type']).count().reset_index(), 
               x="start_date", y='a_dist', color='vehicle_type', 
               labels={'a_dist':'Number of Rentals/Day', 'start_date': 'Date', 'vehicle_type': 'Vehicle'},
               title = " Number of Vehicle Rentals per Day over Trial Period")
line_fig2.show()

Because of this, I'm just going to focus on e-scooters for the remaining analysis.

How are People Using the Scooters?

Initially, I wanted to investigate what usage looks like for the scooters. We expect intuitively that there will be some periodicity to the rental patterns. For instance, there's probably less rentals in the middle of the night than during the day.

The following interactive plot shows the rentals per hour, over the entire trial period. Use the selectors to pick a time interval, and the slider to move the date range:

In [9]:
# Just pick Scooters
scooter = all_trips[all_trips['vehicle_type'] == 'scooter']
In [10]:
scooter2 = scooter.groupby(['datetime']).count().reset_index()


fig3 = go.Figure()
fig3.add_trace(go.Scatter(x=scooter2['datetime'],
                         y=scooter2['a_dist'].values.tolist(), 
               mode = 'lines',
               opacity = 1,
#                line = dict(color = '#17BECF'),
               name = 'Scooter Rentals'))
    
# Set title
fig3.update_layout(
    title_text="Number of Scooter Rentals per Hour",
    xaxis = dict(title = 'Date'),
    yaxis = dict(title = 'Rentals/hr')) 

# Add range slider
fig3.update_layout(
    xaxis=go.layout.XAxis(
        rangeselector=dict(
            buttons=list([              
                dict(count=1,
                     label="1d",
                     step="day",
                     stepmode="todate"),
                dict(count=2,
                     label="2d",
                     step="day",
                     stepmode="todate"),
                dict(count=7,
                     label="7d",
                     step="day",
                     stepmode="todate"),
                dict(count=14,
                     label="14d",
                     step="day",
                     stepmode="todate"),
                dict(count=1,
                     label="1m",
                     step="month",
                     stepmode="todate"),
                dict(count=2,
                     label="2m",
                     step="month",
                     stepmode="todate"),
                dict(step="all")
            ])
        ),
        rangeslider=dict(
            visible=True
        ),
        type="date"
    )
)

fig3.show()

Seems like most rentals occur during the middle of the day. There's a mini spike around 8AM on weekdays, likely corresponding to rides to work. The most rides seem to be towards the afternoon, early evening.

If you scroll around, the most rentals were on September 21 at 7-8pm. Not sure what was going on. Possibly the "Stampede Shindig" at Heritage park? [6]. Let's check a map:

In [11]:
print('Top hour for rentals was: ', str(scooter2.loc[scooter2.vehicle_type.idxmax()][0])[:10])
Top hour for rentals was:  2019-09-21
In [12]:
# Set Mapbox Token
px.set_mapbox_access_token(open(f"{project_dir}\\data\\raw\\mapbox.token").read())

peak_scooter = scooter[scooter['datetime'] == dt.datetime(2019,9,21,17)]

map1 = px.scatter_mapbox(peak_scooter, lat="endy", lon="endx", width=800, height=800, zoom=11, 
                         labels={'endy': "End Point Longitude", 'endx': "End Point Latitude"},
                         center = {'lat':50.98263, 'lon':-114.10210}, title='Rentals on Sept. 9, 2019: 7-8pm')
map1.show()

Not a single scooter terminated at Heritage Park (Map should have centered on location)

They seem to mostly be situated Downtown, so my guess is people going to/from Beakerhead Fesival, which was also that weekend [7]. This is Highly Speculative!

Also, worth noting it was a nice night for scooter riding:

In [13]:
peak_scooter[['datetime','Temp (°C)', 'Wind Spd (km/h)', 'Weather' ]].head(1)
Out[13]:
datetime Temp (°C) Wind Spd (km/h) Weather
117916 2019-09-21 17:00:00 19.2 8.0 Clear

Rentals by Hour

Exploring the cyclical nature of the rentals some more; I wanted to check if there are any interesting patters in rentals based on time of day and if it was a weekend/weekday/holiday.

The following plot shows average rentals per hour over the dataset for the different days.

In [14]:
# Format avg rentals/hr for weekend, holiday and weekdays
weekend_by_hour = scooter[scooter['is_weekend'] == 1].groupby('start_hour').count().reset_index().iloc[:,0:2]
holiday_by_hour = scooter[scooter['is_holiday'] == 1].groupby('start_hour').count().reset_index().iloc[:,0:2]
weekday_by_hour = scooter[(scooter['is_holiday'] == 0) & 
                          (scooter['is_weekend'] == 0)].groupby('start_hour').count().reset_index().iloc[:,0:2]

num_weekends = len(scooter[scooter['is_weekend'] == 1].groupby('start_date').count())
num_holidays = len(scooter[scooter['is_holiday'] == 1].groupby('start_date').count())
num_weekdays = len(scooter[(scooter['is_holiday'] == 0) & (scooter['is_weekend'] == 0)].groupby('start_date').count())
weekday_by_hour['name'] = "Weekday"
holiday_by_hour['name'] = "Holiday"
weekend_by_hour['name'] = "Weekend"
weekday_by_hour['vehicle_type'] = weekday_by_hour['vehicle_type'] / num_weekdays
holiday_by_hour['vehicle_type'] = holiday_by_hour['vehicle_type'] / num_holidays
weekend_by_hour['vehicle_type'] = weekend_by_hour['vehicle_type'] / num_weekends
by_hour = weekday_by_hour.append([holiday_by_hour, weekend_by_hour])

line_fig4 = px.line(by_hour, x='start_hour', y='vehicle_type', color = 'name',
                    title='Scooter Rentals per Hour Based on Day Type',
                    labels={'name': 'Day Type', 'vehicle_type': 'Avg. Scooter Rentals/hr', 'start_hour': 'Time of Day'}
                    )
line_fig4.show()

I think the most interesting observations from this plot are:

  • The little spike in rentals around the morning rush hour, but only on weekdays
  • The increased rentals between midnight and 4am on weekends and holidays. More on this below.

Rentals by Time of Day and Day of Week

A heat map expands on the prior concept of time of the week impacting rentals. On weekdays we see more rentals around 8-9am vs on the weekends. We also see more rentals late night on Friday and Saturday evenings, versus weekday nights. I'm sure no one was "scooting" home from the bar...

In [15]:
week_order = {'start_day':['Monday','Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']}
px.density_heatmap(scooter, x='start_day', y='start_hour', category_orders=week_order, 
                   color_continuous_scale = 'plotly3', title='Rentals Based on Time of Day and Day of Week',
                   labels={'start_day':'Day of Week', 'start_hour':'Hour of Day'},
                   width=800, height=600)

Where are Trips Originating and Ending

We can view the starting and ending coordinates from each scooter rental. As mentioned previously, this is accurate, only to within about 62m from the actual start distance.

While the scooters start throughout the entire city, the rentals are concentrated downtown. There are even a few rentals in Chestermere, which is outside the city limits. Something to explore for future analysis.

In [16]:
# Plot starting point for all scooter trips
grid_count = scooter.groupby('starting_grid_id').count().reset_index().iloc[:,0:2]
grid_count.columns = ['starting_grid_id', 'rental_count']
grid_loc= scooter.groupby('starting_grid_id').mean().reset_index()[['starting_grid_id','startx', 
                                                                    'starty','endx', 'endy',
                                                                    'trip_duration', 'trip_distance']]
grid_count = grid_count.merge(right=grid_loc, on='starting_grid_id')


px.scatter_mapbox(grid_count, lat='starty', lon='startx', color='rental_count',
                  zoom=9,  color_continuous_scale = 'plotly3',mapbox_style='dark',
                  width=800, height=800, title = 'All Trips: Originating Location',
                  labels = {'rental_count':'Total Rentals', 'starty': "Starting Longitude", 
                            'startx': "Starting Latitude"}
                  )
In [17]:
px.scatter_mapbox(grid_count, lat='endy', lon='endx', color='rental_count',
                  zoom=9,  color_continuous_scale = 'plotly3',mapbox_style='dark',
                  width=800, height=800, title = 'All Trips: Finishing Location',
                  labels = {'rental_count':'Total Rentals', 'endy': "Ending Longitude", 
                            'endx': "Ending Latitude"}
                  )

Starting and Ending Locations vs Date

The following animations shows where trips are originating and terminating over the entire trial period. The odd scooter makes its way out of the core, but ultimately that's where most rentals are originating from and also terminating.

In [18]:
grid_date = scooter.groupby(['start_date','starting_grid_id']).count().reset_index().iloc[:,0:3]
grid_loc = scooter.groupby(['start_date','starting_grid_id']).mean().reset_index()[['start_date', 'starting_grid_id',
                                                                                    'startx', 'starty',
                                                                                    'endx', 'endy',
                                                                                    'trip_duration', 'trip_distance']]
grid_date = grid_date.merge(grid_loc, on=['start_date', 'starting_grid_id'])
grid_date.columns = ['start_date', 'starting_grid_id', 'rental_count', 'startx', 'starty', 'endx', 'endy',
                    'trip_duration', 'trip_distance']
grid_date['start_date'] = grid_date['start_date'].apply(lambda x: x.strftime("%d-%b-%Y"))

px.scatter_mapbox(grid_date, lat='starty', lon='startx', color='rental_count',
                  zoom=9,  color_continuous_scale = 'plotly3',mapbox_style='dark',
                  width=800, height=800, title = 'All Trips: Originating Location by Date',
                  labels = {'rental_count':'Total Rentals', 'starty': "Starting Longitude", 
                            'startx': "Starting Latitude", 'start_date':'Date',
                            'trip_duration': "Trip Time (s)", 'trip_distance': 'Trip Distance (m)'},
                  hover_data=['trip_duration', 'trip_distance'],
                  animation_frame = 'start_date'
                  )
In [19]:
px.scatter_mapbox(grid_date, lat='endy', lon='endx', color='rental_count',
                  zoom=9,  color_continuous_scale = 'plotly3',mapbox_style='dark',
                  width=800, height=800, title = 'All Trips: Final Location by Date',
                  labels = {'rental_count':'Total Rentals', 'starty': "Ending Longitude", 
                            'startx': "Ending Latitude", 'start_date':'Date',
                            'trip_duration': "Trip Time (s)", 'trip_distance': 'Trip Distance (m)'},
                  animation_frame = 'start_date',
                  hover_data=['trip_duration', 'trip_distance'],
                  )

Is This a Viable Business?

Another question that I had was how much money the scooters could possibly be making. While I don't have any insights into the business model, we can at least guess how much revenue is generated by the scooters.

Info on pricing wasn't available on Lime's website, but I found an article [8] that mentions \$1 for the first min and \$0.30 thereafter. Further analysis assumes that all trips followed this cost model and that all trips were paid in full i.e. no discounts or promotions. This isn't going to be 100% accurate but it's about the best I can do.

In [20]:
def scooter_revenue(trip_time):
    """Calculates scooter revenue ($) as a function of trip time assuming $1 to start and 
    $0.30/min thereafter"""
    return 0.3*((trip_time-1)//60) + 1
In [21]:
# Calc revenue for all scooter trips
scooter['trip_cost'] = scooter['trip_duration'].apply(scooter_revenue).values
In [22]:
hist_fig1 = px.histogram(scooter.sample(frac=0.05), x='trip_cost', histnorm='probability', marginal = 'box',
             title='Sample Distribution of Cost of Scooter Rentals (Fraction of Rentals)', nbins=50,
             labels={'count': 'Percent of Total Rentals', 'trip_cost':'Total Cost of Trip ($)'})
hist_fig1.show()

Summary Statistics for Trip Cost

In [23]:
print(scooter.trip_cost.describe())
count    464743.000000
mean          4.715451
std           4.062704
min           1.000000
25%           2.500000
50%           3.400000
75%           5.500000
max          48.400000
Name: trip_cost, dtype: float64

Total Projected Revenue:

In [24]:
print(scooter.trip_cost.sum())
2191472.8000000007

Probably as expected, the distribution of trip cost is right skewed with a median trip cost of about \$3/trip and a mean cost of \\$4.37/trip.

Total estimated revenue was more than \$2,000,000 over a 3 month period! And as we saw above, they weren't even fully operational over those three months. I have no insights into the business model, but that's a lot more than I was expecting. I wonder if someone actually paid \\$48 for a scooter trip!

Weather Impacts

One previously stated goal was to study the impact of weather on scooter rentals. It seems intuitive that the weather should impact the number of scooters rented; you could probably predict rentals pretty well just by using the time of day and day of the week. (An exercise for future work)

The first chart shows the total fraction of rentals from the entire dataset. Blue is the total number of scooter rentals and red is the fraction of 'hours' that showed that weather type.

For instance, 43% of the time in the dataset it was clear, but 47% of scooters were rented when it was clear. Conversely 10% of the time it rained, but only 8% of rentals happened when raining.

It didn't snow much over the trial period, but there were very few rentals when it did snow. Worth confirming, but it's possible that the scooters were removed from operation when it snowed in September.

In [25]:
scooter_count = scooter.groupby('Weather').count().reset_index().iloc[:,0:2]
scooter_count.columns = ['Weather', 'Rentals']
weather_count = scooter[['datetime','Weather']].drop_duplicates().groupby('Weather').count().reset_index()
weather_count.columns = ['Weather', 'Hours']
total_count = weather_count.merge(scooter_count, on="Weather")
total_count['Hours']  = total_count['Hours'] / total_count['Hours'].sum()
total_count['Rentals']  = total_count['Rentals'] / total_count['Rentals'].sum()
total_count = total_count.melt(id_vars = 'Weather', value_name = 'Percentage of Total', var_name = 'Category')

# Percentage of renatals with that weather vs percentage of hours with that value
px.bar(total_count, x = 'Weather', y ='Percentage of Total', color = 'Category', barmode = 'group', opacity=1,
       title='Rentals vs Weather')

Comparing temperature to number of rentals, it looks like there are more rentals when it's warmer, but the data also clusters around time of day. i.e. it doesn't matter if it's 20$^{\circ}C$ at midnight, there won't be many rentals.

In [26]:
temperature_df = scooter.groupby('datetime').mean().reset_index()[['datetime', 'Temp (°C)',
                                                                   'Wind Spd (km/h)', 'start_hour']]
rentals_per_hr = scooter.groupby('datetime').count().reset_index().iloc[:,0:2]
rentals_per_hr.columns = ['datetime', 'count']
rentals_per_hr = rentals_per_hr.merge(temperature_df, on='datetime')
rentals_per_hr.datetime = rentals_per_hr.datetime.apply(lambda x: x.strftime("%d-%b-%Y"))
px.scatter(rentals_per_hr, y='count', x='Temp (°C)', color='start_hour', color_continuous_scale = 'plotly3',
           hover_data = {'datetime'}, labels={'datetime':'Date'})

A more interesting plot shows just the rentals vs temperature at 4pm. Here there' is more of a positive trend.

In [27]:
px.scatter(rentals_per_hr[rentals_per_hr.start_hour == 16], y='count', x='Temp (°C)',
           hover_data = {'datetime'}, labels={'datetime':'Date'})

8am however looks more like random scatter. You probably don't care about temperature when deciding if you're riding a scooter to work. Note: Weekends aren't broken out here.

In [28]:
px.scatter(rentals_per_hr[rentals_per_hr.start_hour == 8], y='count', x='Temp (°C)',
           hover_data = {'datetime'}, labels={'datetime':'Date'})

Weather impact on rentals, superficially looks as expected: More rentals when it's nice, and less when it's not. Don't expect the scooters to operate over the winter. More analysis could be done to actually quantify the weather impact on rentals.

Rider Types

The last thing I wanted to do was investigate if I could, at a high level, attempt to classify the types of rides that are happening on the scooters.

I personally witnessed lots of people grabbing the scooters and more or less, "Taking them for a spin," with no real purpose in mind other than to try them out.

The city commissioned a survey and published that one in three trips replaced a car [9]. I'd like to see how plausible that is with the data.

When cleaning the data, I added a column for "aerial distance" which is basically a straight line between the trip stating point and ending point. As mentioned previously, the coordinates are anonymized, so the actual start and end points could be up to ~62m from the point in the dataset. So, the actual aerial distance traveled is +/- ~124m.

I used Principal Component Analysis (PCA) on distance traveled, trip time and aerial distance to see if any interesting observations emerged.

In [29]:
scooter_sample = scooter.sample(frac=0.01) # Use sample so points are actually visible

# Columns for pca
pca_cols=['trip_distance', 'trip_duration', 'a_dist'] 

# Scale data and convert back to a DataFrame
scale = StandardScaler()
df_scaled = scale.fit_transform(scooter_sample[pca_cols])
df_scaled = pd.DataFrame(df_scaled) 
df_scaled.columns = [pca_cols]

# Run PCA on the feature set dataframe
pca = PCA(n_components = 2)
principle_components = pca.fit_transform(df_scaled)

# Stick back into a DataFrame 
df_pca = pd.DataFrame(principle_components)
df_pca.columns = ['pc1','pc2']
df_pca = pd.DataFrame(scale.fit_transform(df_pca))
df_pca.columns = ['pc1', 'pc2']
In [30]:
# Plot using the Principle Components as Axes
sns.lmplot('pc1', 'pc2', df_pca, fit_reg=False, height=8)

# set the maximum variance of the first two PCs
# this will be the end point of the arrow of each **original feature**
xvector = pca.components_[0]
yvector = pca.components_[1]
 
# value of the first two PCs, set the x, y axis boundary
xs = pca.transform(df_scaled)[:,0]
ys = pca.transform(df_scaled)[:,1]

# label countries
# for row in range(0,df_pca.shape[0]):
#      plt.text(df_pca.pc1[row]+0.01, df_pca.pc2[row], 
#      df_pca.country[row], horizontalalignment='left', 
#      size='small', color='grey', weight='light')

# arrows project features (columns from csv) as vectors onto PC axes
for i in range(len(xvector)):
    plt.arrow(0, 0, xvector[i]*max(xs), yvector[i]*max(ys),
              color='r', width=0.005, head_width=0.05)
    plt.text(xvector[i]*max(xs)*1.1, yvector[i]*max(ys)*1.1,
             list(scooter_sample[pca_cols].columns.values)[i], color='r')

plt.annotate("Productive Trips", xy=(6,6)) 
plt.annotate('"Joy Rides!"', xy=(4,-4)) 
plt.title('PCA of Scooter Trip Data')
plt.show()

I call trips migrating towards the top right of this plot "Productive" trips as the aerial distance increases as trip distance increases. Trips in the lower half of the chart I call "Joy Rides" as the trip duration and distance is increasing, but the aerial distance is relatively low. This would represent a trip where someone started and ended at roughly the same place.

As expected, most trips are relatively short in duration and distance.

A metric "Trip Efficiency" is calculated as the ratio of aerial distance to measured trip distance. Theoretically the maximum for this metric should be 1, but due to the inaccuracy of the start and end point, sometimes it is >1. Also, in theory if the scooter was being carried, or on the train etc., this ratio could be >1.

Below is a histogram of trip efficiency for a sample of scooter rentals:

In [31]:
hist_fig2 = px.histogram(scooter.sample(frac=0.05), x='travel_efficiency', histnorm='probability', marginal = 'box',
                         title='Sample Distribution of Travel Efficiency of Scooter Rentals (Fraction of Rentals)', 
                         labels={'travel_efficiency':'Aerial Distance/Trip Distance'},
                         nbins=100)
hist_fig2.show()

We see that the most popular range for trip efficiency is in the 0.4-0.9 range, which is probably about as expected if you were actually using the scooter to go somewhere.

That said there are also a lot of trips with low travel efficiencies. I'd speculate these were more "just for fun" rides.

Classifying Trips

The original question was: What fraction of trips could plausibly have replaced a trip with a car?

I'll use some (made up) assumptions to decide if the trip could have replaced a car. We'll assume your average millennial (implicitly assuming that most renters are millennials) walks at 1.35m/s [10].

Qualifier are:

1) Trip distance must be >810m which is about a 10 min walk. Granted, some people take cars for shorter trips but I’m mostly making this up. 2) Travel efficiency must be >0.3. If you're meandering more than that, I'm guessing you probably are just out for a ride.

That gives me approximately half of trips possibly could replace a car, so the 1 in 3 seems plausible.

In [32]:
# Calculation for above
car_test = scooter[(scooter.trip_distance > 810) & 
                   (scooter.travel_efficiency>0.3)].start_date.count()/scooter.start_date.count()
print("Fraction of trips that could have replaced a car: ", round(car_test,3)*100, "%")
Fraction of trips that could have replaced a car:  49.9 %

If you don't like my assumptions for minimum distance and efficiency threshold, feel free to use the following chart to look up the fraction of trips that could have replaced a car, based on your own assumptions.

In [33]:
# do it on a range of inputs

lazy_threshold = np.arange(100, 1600, 100).tolist()
car_trips = []
travel_eff = np.arange(.1, 1, 0.1).tolist()


for thresh in lazy_threshold:
    for eff in travel_eff:
        car_trips.append(scooter[(scooter.trip_distance > thresh) & 
                            (scooter.travel_efficiency>eff)].start_date.count()/scooter.start_date.count())
        
car_df = pd.DataFrame(zip(travel_eff*len(lazy_threshold),
                      [item for item in lazy_threshold for i in range(len(travel_eff))], car_trips))

line_fig5 = px.line(car_df, x=0, y=2, animation_frame=1, title = "Fraction of Trips That Could Replace a Car",
        labels={"0":"Minimum Travel Efficiency", "1":"Minimum Distance Threshold", "2":"Fraction of Trips"})
line_fig5.show()

Bonus Analysis

According the the City of Calgary's website the maximum speed of the scooters is 20km/h [2].

Looking at the data it appears that some people were able to achieve higher average speeds in practice.

In [34]:
hist_fig3 = px.histogram(scooter.sample(frac=0.05), x='speed', nbins=100, histnorm='probability', marginal = 'box',
                         title='Sample Distribution of Average Scooter Speed (Full Trip)', 
                         labels={'count': 'Percent of Total Rentals', 'speed':'Average Speed (km/h)'},
                         range_x=(0,40))
hist_fig3.show()

Conclusions

It's hard to argue that the e-scooter were quite popular in Calgary. It's no surprize that they will continue next year, while e-bike rentals will not return (At least for Lime).

Rentals cover much of the city with most in the central downtown area.

We don't know all the details of the business model, but the revenue potential is certainly there. While there were certainly many novelty rides, it does look like people were using the scooters to actually travel places. This bodes well for the sustainability of the business model.

In [ ]: